-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: integrate with pgvector #1153
Conversation
If the native database supports vector indexing, I think we should push down it to the native database system. This can be an optimizer rule. If the native database does not support vector indexing, we will do it ourselves. |
Why are we not pushing the |
Based on offline discussion closing this PR |
I reopen this PR because the index scan pass is also implemented in this branch. This PR adds a feature to allow users to create an index on a table using Create index query.CREATE INDEX test_index ON test_data_source.test_vector (embedding)
USING PGVECTOR When users attempt to create an index in Index scan.SELECT idx, embedding FROM test_data_source.test_vector
ORDER BY Similarity(DummyFeatureExtractor(Open(...)), embedding)
LIMIT 1 I take some shortcuts in the optimizer that if the data source is from |
db_catalog_entry.engine, **db_catalog_entry.params | ||
) as handler: | ||
columns = table.table_obj.columns | ||
# As other libraries, we default to HNSW and L2 distance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does other libraries mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant other vector store types that we currently support (e.g., FAISS).
evadb/binder/statement_binder.py
Outdated
), "Index can only be created on an existing table" | ||
|
||
# Vector type specific check. | ||
catalog = self._catalog() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to move it outside
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaurav274 Added some changes. Can you give it a read? Not sure shall we still stick with the singledispatchmethod
approach or do subclass inheritance and add some if statement like the Executor
.
Overall design looks good to me. We may consider the gain/loss of moving some of the push down to optimizer. But I don't think that is the priority now. Left some comments for clarification. |
I will need some feedback on the design here. @xzdandy @gaurav274
This PR reflects my initial design for in-database index features like pgvector. My initial idea is to offload as much as possible to the native database. When we create an index and do the index scan, we simply push an emulated query to the underlying database. While this is doable, this will introduce separate implementation paths in different components including the optimizer and the executor.
Other than the current design, another option is to reuse the third-party vector integration interface. There are also a few details that I am not clear about
(Issue related to performance). Following the current create index implementation, data will be fetched from Postgres first but it is not really needed. The create index will anyway run inside Postgres.The current vector index scan is implemented based on_row_id
. When it is scanning data from Postgres, we need to figure out a way to populate_row_id
for the native database engine.